caret Package (Tutorial)References
# load data
library(kernlab); data(spam); set.seed(333)
# picking a small subset (10 values) from spam data set
smallSpam <- spam[sample(dim(spam)[1],size=10),]
# label spam = 2 and ham = 1 (for coloring)
spamLabel <- (smallSpam$type=="spam")*1 + 1
# plot the capitalAve values (= capital letters) for the dataset with colors differentiated by spam/ham (2 vs 1)
plot(smallSpam$capitalAve,col=spamLabel)# first prediction rule (over-fitting to capture all variation)
rule1 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.7] <- "spam"
prediction[x < 2.40] <- "nonspam"
prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 1 (in sample error -> no error in this case)
table(rule1(smallSpam$capitalAve),smallSpam$type)##
## nonspam spam
## nonspam 5 0
## spam 0 5
# second rule (simple, setting a threshold)
rule2 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.8] <- "spam"
prediction[x <= 2.8] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 2(in sample error -> 10% in this case)
table(rule2(smallSpam$capitalAve),smallSpam$type)##
## nonspam spam
## nonspam 5 1
## spam 0 4
# tabulate out of sample error for algorithm 1
table(rule1(spam$capitalAve),spam$type)##
## nonspam spam
## nonspam 2141 588
## spam 647 1225
# tabulate out of sample error for algorithm 2
table(rule2(spam$capitalAve),spam$type)##
## nonspam spam
## nonspam 2224 642
## spam 564 1171
# accuracy and total correct for algorithm 1 and 2
rbind("Rule 1" = c(Accuracy = mean(rule1(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule1(spam$capitalAve)==spam$type)),
"Rule 2" = c(Accuracy = mean(rule2(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule2(spam$capitalAve)==spam$type)))## Accuracy Total Correct
## Rule 1 0.7315801 3366
## Rule 2 0.7378831 3395
caret Package (Tutorial)preProcess()createDataPartition(), createResample(), createTimeSlices()train(), predict()confusionMatrix()caret package
caret provides uniform framework to build/predict using different models
caret package allows algorithms to be run the same way through predict() functioncreateDataPartition(y=data$var, times=1, p=0.75, list=FALSE) \(\rightarrow\) creates data partitions using given variable
y=data$var = specifies what outcome/variable to split the data ontimes=1 = specifies number of partitions to create (number of data splitting performed)p=0.75 = percent of data that will be for training the modellist=FALSE = returns a matrix of indices corresponding to p% of the data (training set)
list = FALSE is generally what is used list=TRUE = returns a list of indices corresponding to p% of the data (training set)training<-data[inTrain, ] = subsets the data to training set onlytesting<-data[-inTrain, ] = the rest of the data set can then be stored as the test set# load packages and data
library(caret); library(kernlab); data(spam)
# create training set indexes with 75% of data
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
# subset spam data to training
training <- spam[inTrain,]
# subset spam data (the rest) to test
testing <- spam[-inTrain,]
# dimension of original and training dataset
rbind("original dataset" = dim(spam),"training set" = dim(training))## [,1] [,2]
## original dataset 4601 58
## training set 3451 58
createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE) = slices the data in to \(k\) folds for cross validation and returns \(k\) lists of indices
y=data$var = specifies what outcome/variable to split the data onk=10 = specifies number of folds to create (See K Fold Cross Validation)
list=TRUE = returns k list of indices that corresponds to the cross-validated sets
k datasets/vectors of indices, so list=TRUE is generally what is used folds in the case), you can use folds[[1]][1:10] to access different elements from that listlist=FALSE = returns a vector indicating which of the k folds each data point belongs to (i.e. 1 - 10 is assigned for each of the data points in this case)
list=T] returnTrain=TRUE = returns the indices of the training sets
returnTrain=FALSE = returns indices of the test sets# create 10 folds for cross validation and return the training set indices
folds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE)
# structure of the training set indices
str(folds)## List of 10
## $ Fold01: int [1:4141] 1 2 3 4 6 7 8 9 10 12 ...
## $ Fold02: int [1:4140] 1 2 3 4 5 7 8 9 10 11 ...
## $ Fold03: int [1:4141] 1 3 4 5 6 7 8 9 10 11 ...
## $ Fold04: int [1:4141] 2 3 4 5 6 7 8 9 10 11 ...
## $ Fold05: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold06: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold07: int [1:4141] 1 2 3 4 5 6 8 9 11 12 ...
## $ Fold08: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold09: int [1:4141] 1 2 4 5 6 7 10 11 12 13 ...
## $ Fold10: int [1:4141] 1 2 3 5 6 7 8 9 10 11 ...
# return the test set indices instead
# note: returnTrain = FALSE is unnecessary as it is the default behavior
folds.test <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=FALSE)
str(folds.test)## List of 10
## $ Fold01: int [1:460] 15 16 18 40 45 62 68 81 82 102 ...
## $ Fold02: int [1:459] 1 41 55 58 67 75 117 123 151 175 ...
## $ Fold03: int [1:461] 3 14 66 69 70 80 90 112 115 135 ...
## $ Fold04: int [1:460] 5 19 25 65 71 83 85 88 91 93 ...
## $ Fold05: int [1:460] 6 10 17 21 26 56 57 104 107 116 ...
## $ Fold06: int [1:459] 7 8 13 39 52 54 76 89 99 106 ...
## $ Fold07: int [1:461] 4 23 27 29 32 33 34 38 49 51 ...
## $ Fold08: int [1:460] 2 9 30 31 36 37 43 46 47 48 ...
## $ Fold09: int [1:461] 12 20 24 44 53 59 60 64 84 98 ...
## $ Fold10: int [1:460] 11 22 28 35 42 61 72 86 92 118 ...
# return first 10 elements of the first training set
folds[[1]][1:10]## [1] 1 2 3 4 6 7 8 9 10 12
createResample(y=data$var, times=10, list=TRUE) = create 10 resamplings from the given data with replacement
list=TRUE = returns list of n vectors that contain indices of the sample
times=10 = number of samples to create# create 10 resamples
resamples <- createResample(y=spam$type,times=10,list=TRUE)
# structure of the resamples (note some samples are repeated)
str(resamples)## List of 10
## $ Resample01: int [1:4601] 1 4 4 4 7 8 12 13 13 14 ...
## $ Resample02: int [1:4601] 3 3 5 7 10 12 12 13 13 14 ...
## $ Resample03: int [1:4601] 1 2 2 3 4 5 8 10 11 12 ...
## $ Resample04: int [1:4601] 1 3 3 4 7 8 8 9 10 14 ...
## $ Resample05: int [1:4601] 2 4 5 6 7 7 8 8 9 12 ...
## $ Resample06: int [1:4601] 3 6 6 7 8 9 12 13 13 14 ...
## $ Resample07: int [1:4601] 1 2 2 5 5 6 7 8 9 10 ...
## $ Resample08: int [1:4601] 2 2 3 4 4 7 7 8 8 9 ...
## $ Resample09: int [1:4601] 1 4 7 8 8 9 12 13 15 15 ...
## $ Resample10: int [1:4601] 1 3 4 4 7 7 9 9 10 11 ...
createTimeSlices(y=data, initialWindow=20, horizon=10) = creates training sets with specified window length and the corresponding test sets
initialWindow=20 = number of consecutive values in each time slice/training set (i.e. values 1 - 20)horizon=10 = number of consecutive values in each predict/test set (i.e. values 21 - 30)fixedWindow=FALSE = training sets always start at the first observation
# create time series data
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
# name of lists
names(folds)## [1] "train" "test"
# first training set
folds$train[[1]]## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# first test set
folds$test[[1]]## [1] 21 22 23 24 25 26 27 28 29 30
train(y ~ x, data=df, method="glm") = function to apply the machine learning algorithm to construct model from training data# returns the arguments of the default train function
args(caret:::train.default)## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,
## metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric %in%
## c("RMSE", "logLoss", "MAE"), FALSE, TRUE), trControl = trainControl(),
## tuneGrid = NULL, tuneLength = ifelse(trControl$method ==
## "none", 1, 3))
## NULL
train function has a large set of parameters, below are the default options
method="rf" = default algorithm is random forest for training a given data set; caret contains a large number of algorithms
names(getModelInfo()) = returns all the options for method argumentpreProcess=NULL = set preprocess options (see Preprocessing)weights=NULL = can be used to add weights to observations, useful for unbalanced distribution (a lot more of one type than another)metric=ifelse(is.factor(y), "Accuracy", "RMSE") = default metric for algorithm is Accuracy for factor variables, and RMSE, or root mean squared error, for continuous variables
maximize=ifelse(metric=="RMSE", FALSE, TRUE) = the algorithm should maximize accuracy and minimize RMSEtrControl=trainControl() = training controls for the model, more details belowtuneGrid=NULLtuneLength=3# returns the default arguments for the trainControl object
args(trainControl)## function (method = "boot", number = ifelse(grepl("cv", method),
## 10, 25), repeats = ifelse(grepl("[d_]cv$", method), 1, NA),
## p = 0.75, search = "grid", initialWindow = NULL, horizon = 1,
## fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE,
## returnResamp = "final", savePredictions = FALSE, classProbs = FALSE,
## summaryFunction = defaultSummary, selectionFunction = "best",
## preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5,
## freqCut = 95/5, uniqueCut = 10, cutoff = 0.9), sampling = NULL,
## index = NULL, indexOut = NULL, indexFinal = NULL, timingSamps = 0,
## predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5,
## alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE,
## allowParallel = TRUE)
## NULL
trainControl creates an object that sets many options for how the model will be applied to the training data
method="boot" = for controlling the resampling
"boot" = bootstrapping (drawing with replacement)"boot632" = bootstrapping with adjustment"cv" = cross validation"repeatedcv" = repeated cross validation"LOOCV" = leave one out cross validationnumber=ifelse(grepl("cv", method),10, 25) = number of subsamples to take (for boot/cross validation)
number=10 = default for any kind of cross validationnumber=25 = default for bootstrappingnumber should be increased when fine-tuning model with large number of parameter repeats=ifelse(grepl("cv", method), 1, number) = numbers of times to repeat the subsampling
repeats=1 = default for any cross validation methodrepeats=25 = default for bootstrappingp=0.75 = default percentage of data to create training setsinitialWindow=NULL, horizon=1, fixedWindow=TRUE = parameters for time series dataverboseIter=FALSE = print the training logsreturnData=TRUE, returnResamp = “final”,savePredictions=FALSE = save the predictions for each resampleclassProbs=FALSE = return classification probabilities along with the predictionssummaryFunction=defaultSummary = default summary of the model,preProcOptions=list(thresh = 0.95, ICAcomp = 3, k = 5) = specifies preprocessing options for the modelpredictionBounds=rep(FALSE, 2) = specify the range of the predicted value
predictionBounds=c(10, NA) would mean that any value lower than 10 would be treated as 10 and no upper boundsseeds=NA = set the seed for the operation
train function is run allowParallel=TRUE = sets for parallel processing/computationsfeaturePlot(x=predictors, y=outcome, plot="pairs") = short cut to plot the relationships between the predictors and outcomes# load relevant libraries
library(ISLR); library(ggplot2); library(caret)
# load wage data
data(Wage)
# create training and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
# plot relationships between the predictors and outcome
featurePlot(x=training[,c("age","education","jobclass")], y = training$wage,plot="pairs")qplot(age, wage, color=eduction, data=training) = can be used to separate data by a factor variable (by coloring the points differently)
geom_smooth(method = "lm") = adds a regression line to the plotsgeom=c("boxplot", "jitter") = specifies what kind of plot to produce, in this case both the boxplot and the point cloud# qplot plus linear regression lines
qplot(age,wage,colour=education,data=training)+geom_smooth(method='lm',formula=y~x)cut2(variable, g=3) = creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles
cut2 function is part of the Hmisc package, so library(Hmisc) must be run first grid.arrange(p1, p2, ncol=2) = ggplot2 function the print multiple graphs on the same plot
grid.arrange function is part of the gridExtra package, so library(gridExtra) must be run first # load Hmisc and gridExtra packages
library(Hmisc);library(gridExtra);
# cute the wage variable
cutWage <- cut2(training$wage,g=3)
table(cutWage)## cutWage
## [ 20.1, 92.7) [ 92.7,119.1) [119.1,318.3]
## 701 723 678
# plot the boxplot
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot"))
# plot boxplot and point clusters
p2 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot","jitter"))
# plot the two graphs side by side
grid.arrange(p1,p2,ncol=2)table(cutVariable, data$var2) = tabulates the cut factor variable vs another variable in the dataset (ie; builds a contingency table using cross-classifying factors)prop.table(table, margin=1) = converts a table to a proportion table
margin=1 = calculate the proportions based on the rowsmargin=2 = calculate the proportions based on the columns# tabulate the cutWage and jobclass variables
t <- table(cutWage,training$jobclass)
# print table
t##
## cutWage 1. Industrial 2. Information
## [ 20.1, 92.7) 445 256
## [ 92.7,119.1) 376 347
## [119.1,318.3] 269 409
# convert to proportion table based on the rows
prop.table(t,1)##
## cutWage 1. Industrial 2. Information
## [ 20.1, 92.7) 0.6348074 0.3651926
## [ 92.7,119.1) 0.5200553 0.4799447
## [119.1,318.3] 0.3967552 0.6032448
qplot(var1, color=var2, data=training, geom="density") = produces density plot for the given numeric and factor variables
# produce density plot
qplot(wage,colour=education,data=training,geom="density")test set with the mean and standard deviation of the train variables
train(y~x, data=training, preProcess=c("center", "scale")) = preprocessing can be directly specified in the train function
preProcess=c("center", "scale") = normalize all predictors before constructing modelpreProcess(trainingData, method=c("center", "scale") = function in the caret to standardize data
preProcess function as an object and apply it to the train and test sets using the predict function# load spam data
library(caret); library(kernlab); data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create a histogram
hist(training$capitalAve,main="",xlab="ave. capital run length")# mean and sd
mean(training$capitalAve)## [1] 4.688791
sd(training$capitalAve)## [1] 26.64678
# create preProcess object for all predictors ("-58" because 58th = outcome)
preObj <- preProcess(training[,-58],method=c("center","scale"))
# normalize training set
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# normalize test set using training parameters
testCapAveS <- predict(preObj,testing[,-58])$capitalAve
# compare results for capitalAve variable
rbind(train = c(mean = mean(trainCapAveS), std = sd(trainCapAveS)),
test = c(mean(testCapAveS), sd(testCapAveS)))## mean std
## train 6.097035e-18 1.000000
## test 7.548133e-02 1.633866
preprocess(data, method="BoxCox") = applies Box-Cox transformations to continuous data to help normalize the variables through maximum likelihood
qqnorm(processedVar) = can be used to produce the Q-Q plot which compares the theoretical quantiles with the sample quantiles to see the normality of the data# set up BoxCox transforms
preObj <- preProcess(training[,-58],method=c("BoxCox"))
# perform preprocessing on training data
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# plot histogram and QQ Plot
# Note: the transformation definitely helped to
# normalize the data but it does not produce perfect result
par(mfrow=c(1,2)); hist(trainCapAveS); qqnorm(trainCapAveS)preProcess(data, method="knnImpute") = impute/estimate the missing data using k nearest neighbors (knn) imputation
knnImpute = takes the k nearest neighbors from the missing value and averages the value to impute the missing observations# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
# compute differences between imputed values and true values
# to see how close those two values are to each other
quantile(capAve - capAveTruth)## 0% 25% 50% 75% 100%
## -1.656344e+00 2.377772e-05 1.286900e-03 1.880821e-03 3.174413e-01
quantile((capAve - capAveTruth)[selectNA])## 0% 25% 50% 75% 100%
## -1.656344036 -0.009489595 0.001683679 0.015559636 0.317441284
quantile((capAve - capAveTruth)[!selectNA])## 0% 25% 50% 75% 100%
## -8.465254e-01 9.822581e-05 1.284404e-03 1.850043e-03 2.349137e-03
K-nearest neighbors imputing finds the k, for example if k equals to ten, then the 10 nearest data vectors that look most like the data vector with the missing value, and averages the values of the variable that’s missing and compute them at that position
preProcess() can be leveraged to handle creating new covariatesdummyVars(outcome~var, data=training) = creates a dummy variable object that can be used through predict function to create dummy variables
predict(dummyObj, newdata=training) = creates appropriate columns to represent the factor variable with appropriate 0s and 1s
library(ISLR); library(caret); data(Wage);
# setting up data
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
# create the dummy variable columns
head(predict(dummies,newdata=training))## jobclass.1. Industrial jobclass.2. Information
## 231655 1 0
## 86582 0 1
## 161300 1 0
## 155159 0 1
## 11443 0 1
## 376662 0 1
nearZeroVar(training, saveMetrics=TRUE) = returns list of variables in training data set with information on frequency ratios, percent uniques, whether or not it has zero variance
freqRatio = ratio of frequencies for the most common value over second most common valuepercentUnique = percentage of unique data points out of total number of data pointszeroVar = TRUE/FALSE indicating whether the predictor has only one distinct valuenzv = TRUE/FALSE indicating whether the predictor is a near zero variance predictornzv = TRUE, those variables should be thrown out # print nearZeroVar table
nearZeroVar(training,saveMetrics=TRUE)## freqRatio percentUnique zeroVar nzv
## year 1.017647 0.33301618 FALSE FALSE
## age 1.231884 2.85442436 FALSE FALSE
## maritl 3.329571 0.23786870 FALSE FALSE
## race 8.480583 0.19029496 FALSE FALSE
## education 1.393750 0.23786870 FALSE FALSE
## region 0.000000 0.04757374 TRUE TRUE
## jobclass 1.070936 0.09514748 FALSE FALSE
## health 2.526846 0.09514748 FALSE FALSE
## health_ins 2.209160 0.09514748 FALSE FALSE
## logwage 1.011765 18.83920076 FALSE FALSE
## wage 1.011765 18.83920076 FALSE FALSE
splines package] bs(data$var, df=3) = creates 3 new columns corresponding to the var, var2, and var3 termsns() and poly() can also be used to generate polynomialsgam() function in the caret package can also be used and it allows for smoothing of multiple variables with different values for each variablepredict function # load splines package
library(splines)
# create polynomial function
bsBasis <- bs(training$age,df=3)
head(bsBasis) # (values are scaled for computational purposes)## 1 2 3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4163380 0.32117502 0.082587862
## [4,] 0.4308138 0.29109043 0.065560908
## [5,] 0.3625256 0.38669397 0.137491189
## [6,] 0.3063341 0.42415495 0.195763821
# fit the outcome on the three polynomial terms
lm1 <- lm(wage ~ bsBasis,data=training)
# plot all age vs wage data
plot(training$age,training$wage,pch=19,cex=0.5)
# plot the fitted polynomial function
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)# predict function on test values
head(predict(bsBasis,age=testing$age))## 1 2 3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4163380 0.32117502 0.082587862
## [4,] 0.4308138 0.29109043 0.065560908
## [5,] 0.3625256 0.38669397 0.137491189
## [6,] 0.3063341 0.42415495 0.195763821
prcomp Functionpr<-prcomp(data) = performs PCA on all variables and returns a prcomp object that contains information about standard deviations and rotations
pr$rotations = returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) \(\rightarrow\) how the principal components are createdlog transformation of the variables and adding 1 before performing PCA
plot(pr) = plots the percent variation explained by the first 10 principal components (PC)
# load spam data
library(caret); library(kernlab); data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# correlated predictors
M <- abs(cor(training[,-58]))
diag(M) <- 0
which(M > 0.8, arr.ind=T)## row col
## num415 34 32
## direct 40 32
## num857 32 34
## direct 40 34
## num857 32 40
## num415 34 40
# plot for example num857 and num415
# frequencies of the two variables are highly correlated
plot(spam[,34],spam[,32])# we could rotate the plot
# X = 0.71*num415 + 0.71*num857
# Y = 0.71*num415 - 0.71*num857
X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
# perform PCA on num415 and num857
smallSpam <- spam[,c(34,32)]
prComp <- prcomp(smallSpam)
# plot rotation and first two PCs
par(mfrow = c(1,2))
plot(X,Y) # most of the variability is happening on the x-axis, clustered on zero
plot(prComp$x[,1],prComp$x[,2])prComp$rotation## PC1 PC2
## num415 0.7080625 0.7061498
## num857 0.7061498 -0.7080625
# perform PCA on dataset
# use log and add 1 helps to reduce skewness or strange distribution in data
prComp <- prcomp(log10(spam[,-58]+1))
# print out the eigenvector/rotations first 5 rows and PCs
# eg. PC1 is made up by a linear combination of the variables (column 1)
head(prComp$rotation[, 1:5], 5)## PC1 PC2 PC3 PC4 PC5
## make 0.019370409 0.0427855959 -0.01631961 0.02798232 -0.014903314
## address 0.010827343 0.0408943785 0.07074906 -0.01407049 0.037237531
## all 0.040923168 0.0825569578 -0.03603222 0.04563653 0.001222215
## num3d 0.006486834 -0.0001333549 0.01234374 -0.01005991 -0.001282330
## our 0.036963221 0.0941456085 -0.01871090 0.05098463 -0.010582039
# create new variable that marks spam as 2 and nospam as 1
typeColor <- ((spam$type=="spam")*1 + 1)
# plot the first two principal components
par(mfrow = c(1,1))
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")caret Packagepp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8)) = perform PCA with preProcess function and returns the number of principal components that can capture the majority of the variation
preProcess object that can be applied using predict functionpcaComp=2 = specifies the number of principal components to compute (2 in this case)thresh=0.8 = threshold for variation captured by principal components
thresh=0.95 = default value, which returns the number of principal components that are needed to capture 95% of the variation in datapredict(pp, training) = computes new variables for the PCs (2 in this case) for the training data set
predict can then be used as data for the prediction model# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# plot the two principal components
typeColor <- ((spam$type=="spam")*1 + 1)
plot(trainPC[,1],trainPC[,2],col=typeColor)# run model on outcome and principle components
# OLDER VERSION modelFit <- train(training$type ~ .,method="glm",data=trainPC)
modelFit <- train(x = trainPC, y = training$type, method="glm")
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 650 47
## spam 74 379
##
## Accuracy : 0.8948
## 95% CI : (0.8756, 0.9119)
## No Information Rate : 0.6296
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7773
## Mcnemar's Test P-Value : 0.0181
##
## Sensitivity : 0.8978
## Specificity : 0.8897
## Pos Pred Value : 0.9326
## Neg Pred Value : 0.8366
## Prevalence : 0.6296
## Detection Rate : 0.5652
## Detection Prevalence : 0.6061
## Balanced Accuracy : 0.8937
##
## 'Positive' Class : nonspam
##
train method
train(outcome ~ ., method="glm", preProcess="pca", data=training) = performs PCA first on the training set and then runs the specified model
preProcess \(\rightarrow\) predict)# construct model
modelFit <- train(type ~ ., method="glm", preProcess="pca", data=training)
# print results of model
confusionMatrix(testing$type,predict(modelFit,testing))## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 661 36
## spam 55 398
##
## Accuracy : 0.9209
## 95% CI : (0.9037, 0.9358)
## No Information Rate : 0.6226
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8331
## Mcnemar's Test P-Value : 0.05917
##
## Sensitivity : 0.9232
## Specificity : 0.9171
## Pos Pred Value : 0.9484
## Neg Pred Value : 0.8786
## Prevalence : 0.6226
## Detection Rate : 0.5748
## Detection Prevalence : 0.6061
## Balanced Accuracy : 0.9201
##
## 'Positive' Class : nonspam
##
lm<-lm(y ~ x, data=train) = runs a linear model of outcome y on predictor x \(\rightarrow\) univariate regression
summary(lm) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p valueslm(y ~ x1+x2+x3, data=train) = run linear model of outcome y on predictors x1, x2, and x3lm(y ~ ., data=train = run linear model of outcome y on all predictorspredict(lm, newdata=df) = use the constructed linear model to predict outcomes (\(\hat Y_i\)) for the new values
newdata data frame must have the same variables (factors must have the same levels) as the training datanewdata=test = predict outcomes for the test set based on linear regression model from the training# load data
library(caret); data(faithful); set.seed(333)
# create train and test sets
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
# explore data
head(trainFaith)## eruptions waiting
## 1 3.600 79
## 3 3.333 74
## 5 4.533 85
## 6 2.883 55
## 7 4.700 88
## 8 3.600 85
plot(trainFaith$waiting, trainFaith$eruptions, pch=19, col="blue", xlab="Waiting", ylab="Duration")# build linear model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
# print summary of linear model
summary(lm1)##
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26990 -0.34789 0.03979 0.36589 1.05020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.792739 0.227869 -7.867 1.04e-12 ***
## waiting 0.073901 0.003148 23.474 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.495 on 135 degrees of freedom
## Multiple R-squared: 0.8032, Adjusted R-squared: 0.8018
## F-statistic: 551 on 1 and 135 DF, p-value: < 2.2e-16
# predict eruptions for new waiting time
newdata <- data.frame(waiting=80)
predict(lm1,newdata)## 1
## 4.119307
# create 1 x 2 panel plot
par(mfrow=c(1,2))
# plot train data with the regression line
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Train")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# plot test data with the regression line
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Test")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)# Calculate RMSE on training and test sets
anova.table <-anova(lm1)
trainRMSE.anova <- sqrt(anova.table$"Mean Sq"[2])
c(trainRMSE.anova = trainRMSE.anova,
trainRMSE = sqrt(sum((lm1$fitted-trainFaith$eruptions)^2/(nrow(trainFaith) - 2))),
testRMSE = sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2/(nrow(testFaith) - 2))))## trainRMSE.anova trainRMSE testRMSE
## 0.4950413 0.4950413 0.5062673
pi<-predict(lm, newdata=test, interval="prediction") = returns 3 columns for fit (predicted value, same as before), lwr (lower bound of prediction interval), and upr (upper bound of prediction interval)
matlines(x, pi, type="l") = plots three lines, one for the linear fit and two for upper/lower prediction interval bounds# calculate prediction intervals
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
# plot data points (eruptions, waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
# plot fit line and prediction interval
matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)lm <- train(y ~ x, method="lm", data=train) = run linear model on the training data \(\rightarrow\) identical to lm function
summary(lm$finalModel) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values \(\rightarrow\) identical to summary(lm) for a lm objecttrain(y ~ ., method="lm", data=train) = run linear model on all predictors in training data
plot(lm$finalModel) = construct 4 diagnostic plots for evaluating the model
?plot.lm # same process with caret package
modFit <- train(eruptions ~ waiting, data=trainFaith, method="lm")
summary(modFit$finalModel)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26990 -0.34789 0.03979 0.36589 1.05020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.792739 0.227869 -7.867 1.04e-12 ***
## waiting 0.073901 0.003148 23.474 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.495 on 135 degrees of freedom
## Multiple R-squared: 0.8032, Adjusted R-squared: 0.8018
## F-statistic: 551 on 1 and 135 DF, p-value: < 2.2e-16
# load data
library(ISLR); library(ggplot2); library(caret);
data(Wage)
# create train and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# fit linear model for age jobclass and education
modFit<- train(wage ~ age + jobclass + education,method = "lm",data=training)
# store final model
finMod <- modFit$finalModel
# summary of modFit
print(modFit)## Linear Regression
##
## 2102 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35.79066 0.248127 24.68782
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# set up 2 x 2 panel plot
par(mfrow = c(2, 2))
# construct diagnostic plots for model
plot(finMod,pch=19,cex=0.5,col="#00000010")# plot fitted values by residuals
qplot(finMod$fitted, finMod$residuals, color=race, data=training)plot(finMod$residuals) = plot the residuals against index (row number)# plot residual by index
plot(finMod$residuals,pch=19,cex=0.5)pred <- predict(modFit, testing)
qplot(wage, pred, colour=year, data=testing)party, rpart) and out of the caret package (tree)\[\hat{p}_{mk} = \frac{\sum_{i}^m \mathbb{1}(y_i = k)}{N_m}\]
\(N_m\) is the size of the group
Example
# set margin and seed
par(mar=c(1,1,1,1), mfrow = c(1, 2)); set.seed(1234);
# simulate data
x = rep(1:4,each=4); y = rep(1:4,4)
# plot first scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",15),rep("red",1)),pch=19)
# plot second scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",8),rep("red",8)),pch=19)caret Packagetree<-train(y ~ ., data=train, method="rpart") = constructs trees based on the outcome and predictors
rpart object, which can be used to predict new/test valuesprint(tree$finalModel) = returns text summary of all nodes/splits in the tree constructedplot(tree$finalModel, uniform=TRUE) = plots the classification tree with all nodes/splits
rattle package] fancyRpartPlot(tree$finalModel) = produces more readable, better formatted classification tree diagrams# load iris data set
data(iris)
names(iris)## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
# want to predict species
table(iris$Species)##
## setosa versicolor virginica
## 50 50 50
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# plot iris widths vs sepal widths
qplot(Petal.Width, Sepal.Width, colour=Species, data=training)# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 35 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Width< 1.65 34 1 versicolor (0.00000000 0.97058824 0.02941176) *
## 7) Petal.Width>=1.65 36 2 virginica (0.00000000 0.05555556 0.94444444) *
# simple dendogram
plot(modFit$finalModel, uniform=TRUE, main="Classification Tree")
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)# predict on test values
predict(modFit,newdata=testing)## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] versicolor virginica virginica versicolor versicolor virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
often used with trees (an extension is random forests)
loess(y ~ x, data=train, span=0.2) = fits a smooth curve to data
span=0.2 = controls how smooth the curve should be# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# want to predict temperature
head(ozone)## ozone radiation temperature wind
## 17 1 8 59 9.7
## 19 4 25 61 9.7
## 14 6 78 57 18.4
## 45 7 48 80 14.3
## 106 7 49 69 10.3
## 7 8 19 61 20.1
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
# resample with replacement
ss <- sample(1:dim(ozone)[1],replace=T)
# draw sample from the data and reorder rows based on ozone
ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
# fit loess function through data (similar to spline)
loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
# predict for every single loess curve the outcome for ozone values 1 to 155
ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)caret package, there are three options for the train function to perform bagging
bagEarth - Bagged MARS (documentation)treebag - Bagged CART (documentation)bagFDA - Bagged Flexible Discriminant Analysis (documentation)bag functions can be constructed (documentation)
bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate)) = define and execute custom bagging algorithm
B=10 = number of iterations/resampling to performbagControl() = controls for how the bagging should be executed
fit=ctreeBag$fit = the function that’s going to be applied to fit the model every time (could be a call to the train function in the caret package)predict=ctreeBag$predict = the way that given a particular model fit to predict new values (could be a call to the predict function from a trained model)aggregate=ctreeBag$aggregate = the way to put the predictions together (eg. average the predictions across all the different replicated samples)# load relevant package and data
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
# custom bagging function
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")# look at the different parts of bagging
ctreeBag$fit## function (x, y, ...)
## {
## loadNamespace("party")
## data <- as.data.frame(x)
## data$y <- y
## party::ctree(y ~ ., data = data)
## }
## <environment: namespace:caret>
ctreeBag$pred## function (object, x)
## {
## if (!is.data.frame(x))
## x <- as.data.frame(x)
## obsLevels <- levels(object@data@get("response")[, 1])
## if (!is.null(obsLevels)) {
## rawProbs <- party::treeresponse(object, x)
## probMatrix <- matrix(unlist(rawProbs), ncol = length(obsLevels),
## byrow = TRUE)
## out <- data.frame(probMatrix)
## colnames(out) <- obsLevels
## rownames(out) <- NULL
## }
## else out <- unlist(party::treeresponse(object, x))
## out
## }
## <environment: namespace:caret>
ctreeBag$aggregate## function (x, type = "class")
## {
## if (is.matrix(x[[1]]) | is.data.frame(x[[1]])) {
## pooled <- x[[1]] & NA
## classes <- colnames(pooled)
## for (i in 1:ncol(pooled)) {
## tmp <- lapply(x, function(y, col) y[, col], col = i)
## tmp <- do.call("rbind", tmp)
## pooled[, i] <- apply(tmp, 2, median)
## }
## if (type == "class") {
## out <- factor(classes[apply(pooled, 1, which.max)],
## levels = classes)
## }
## else out <- as.data.frame(pooled)
## }
## else {
## x <- matrix(unlist(x), ncol = length(x))
## out <- apply(x, 1, median)
## }
## out
## }
## <environment: namespace:caret>
ctreeBag$fit uses the ctree function to train a conditional regression treectreeBag$pred takes in the object from the ctree model fit, and a new data set x to get a new predictionctreeBag$aggregate takes those predicted values and averages them together (or puts them together in some other way)rfcv function)rf<-train(outcome ~ ., data=train, method="rf", prox=TRUE, ntree=500) = runs random forest algorithm on the training data against all predictors
prox=TRUE = the proximity measures between observations should be calculated (used in functions such as classCenter() to find center of groups)
rf$finalModel$prox = returns matrix of proximitiesntree=500 = specify number of trees that should be constructeddo.trace=TRUE = prints logs as the trees are being built \(\rightarrow\) useful by indicating progress to userrandomForest() function can be used to perform random forest algorithm (syntax is the same as train) and is much faster getTree(rf$finalModel, k=2) = return specific tree from random forest modelclassCenters(predictors, outcome, proximity, nNbr) = return computes the cluster centers using the nNbr nearest neighbors of the observations
prox = rf$finalModel$prox = proximity matrix from the random forest modelnNbr = number of nearest neighbors that should be used to compute cluster centerspredict(rf, test) = apply the random forest model to test data set
confusionMatrix(predictions, actualOutcome) = tabulates the predictions of the model against the truths
# load data
library(caret)
data(iris)
# create train/test data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# apply random forest
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
modFit## Random Forest
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9481790 0.9206752
## 3 0.9501546 0.9236665
## 4 0.9513311 0.9254397
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
# better to use randomForest function (faster)
modelRf <- randomForest(Species ~ ., data = training, importance = FALSE)
# return the second tree (first 6 rows)
# each of these rows corresponds to a particular split
head(getTree(modFit$finalModel,k=2))## left daughter right daughter split var split point status prediction
## 1 2 3 4 0.70 1 0
## 2 0 0 0 0.00 -1 1
## 3 4 5 3 4.75 1 0
## 4 6 7 4 1.65 1 0
## 5 8 9 4 1.70 1 0
## 6 0 0 0 0.00 -1 2
# compute cluster centers
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
# convert irisP to data frame and add Species column
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
# plot data points
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
# add the cluster centers
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)# predict outcome for test data set using the random forest model
pred <- predict(modFit,testing)
# logic value for whether or not the rf algorithm predicted correctly
testing$predRight <- pred==testing$Species
# tabulate results
table(pred,testing$Species)##
## pred setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 12 0
## virginica 0 3 15
# plot data points with the incorrect classification highlighted
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")a technical tutorial can be found here
Example
gbm <- train(outcome ~ variables, method="gbm", data=train, verbose=F) = run boosting model on the given data
method of boosting are:
gbm - boosting with treesmboost - model based boostingada - statistical boosting based on additive logistic regressiongamBoost for boosting generalized additive modelspredict function can be used to apply the model to test data, similar to the rest of the algorithms in caret package
Example
# load data
# load data
library(ISLR); library(caret); library(ggplot2)
data(Wage)
# remove log wage variable (we are trying to predict wage)
Wage <- subset(Wage,select=-c(logwage))
# create train/test data sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# run the gbm model
modFit <- train(wage ~ ., method="gbm",data=training,verbose=FALSE)
# print model summary
print(modFit)## Stochastic Gradient Boosting
##
## 2102 samples
## 9 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 35.40267 0.3268457 24.04840
## 1 100 34.89519 0.3332554 23.77177
## 1 150 34.82822 0.3343683 23.80400
## 2 50 34.86088 0.3363506 23.67719
## 2 100 34.72696 0.3385710 23.66990
## 2 150 34.75429 0.3373207 23.72571
## 3 50 34.73929 0.3384432 23.63118
## 3 100 34.78334 0.3364309 23.74651
## 3 150 34.88069 0.3334293 23.86710
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
# plot the results (not bad, but still some variability)
qplot(predict(modFit, testing), wage, data=testing)lda<-train(outcome ~ predictors, data=training, method="lda") = constructs a linear discriminant analysis model on the predictors with the provided training datapredict(lda, test) = applies the LDA model to test data and return the prediction results in data framecaret package# load data
data(iris)
table(iris$Species)##
## setosa versicolor virginica
## 50 50 50
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# run the linear discriminant analysis on training data
lda <- train(Species ~ .,data=training,method="lda")
# predict test outcomes using LDA model
pred.lda <- predict(lda,testing)
# print results
pred.lda## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor virginica versicolor versicolor
## [25] versicolor virginica versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] virginica virginica virginica virginica virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
nb <- train(outcome ~ predictors, data=training, method="nb") = constructs a naive Bayes model on the predictors with the provided training datapredict(nb, test) = applies the naive Bayes model to test data and return the prediction results in data framecaret package# using the same data from iris, run naive Bayes on training data
nb <- train(Species ~ ., data=training,method="nb")
# predict test outcomes using naive Bayes model
pred.nb <- predict(nb,testing)
# print results
pred.nb## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor virginica versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] virginica virginica virginica virginica virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
iris data set, we can compare the prediction the results from the two models# tabulate the prediction results from LDA and naive Bayes
table(pred.lda,pred.nb)## pred.nb
## pred.lda setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 13 0
## virginica 0 1 16
# create logical variable that returns TRUE for when predictions from the two models match
equalPredictions <- (pred.lda==pred.nb)
# plot the comparison
qplot(Petal.Width,Sepal.Width,colour=equalPredictions,data=testing)prostate dataset from Elements of Statistical Learning is usedlibrary(ElemStatLearn)
# load data and set seed
data(prostate); set.seed(1)
str(prostate)## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
# define outcome y and predictors x
covnames <- names(prostate[-(9:10)])
y <- prostate$lpsa; x <- prostate[,covnames]
# create test set predictors and outcomes
train.ind <- sample(nrow(prostate), ceiling(nrow(prostate))/2)
y.test <- prostate$lpsa[-train.ind]; x.test <- x[-train.ind,]
# create training set predictors and outcomes
y <- prostate$lpsa[train.ind]; x <- x[train.ind,]
# p = number of predictors
p <- length(covnames)
# initialize the list of residual sum squares
rss <- list()
# loop through each combination of predictors and build models
for (i in 1:p) {
# compute matrix for p choose i predictors for i = 1...p (creates i x p matrix)
Index <- combn(p,i)
# calculate residual sum squares of each combination of predictors
rss[[i]] <- apply(Index, 2, function(is) {
# take each combination (or column of Index matrix) and create formula for regression
form <- as.formula(paste("y~", paste(covnames[is], collapse="+"), sep=""))
# run linear regression with combination of predictors on training data
isfit <- lm(form, data=x)
# predict outcome for all training data points
yhat <- predict(isfit)
# calculate residual sum squares for predictions on training data
train.rss <- sum((y - yhat)^2)
# predict outcome for all test data points
yhat <- predict(isfit, newdata=x.test)
# calculate residual sum squares for predictions on test data
test.rss <- sum((y.test - yhat)^2)
# store each pair of training and test residual sum squares as a list
c(train.rss, test.rss)
})
}
# set up plot with labels, title, and proper x and y limits
plot(1:p, 1:p, type="n", ylim=range(unlist(rss)), xlim=c(0,p),
xlab="Number of Predictors", ylab="Residual Sum of Squares",
main="Prostate Cancer Data - Training vs Test RSS")
# add data points for training and test residual sum squares
for (i in 1:p) {
# plot training residual sum squares in blue
points(rep(i, ncol(rss[[i]])), rss[[i]][1, ], col="blue", cex = 0.5)
# plot test residual sum squares in red
points(rep(i, ncol(rss[[i]])), rss[[i]][2, ], col="red", cex = 0.5)
}
# find the minimum training RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[1,]))
# plot line through the minimum training RSS data points in blue
lines((1:p), minrss, col="blue", lwd=1.7)
# find the minimum test RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[2,]))
# plot line through the minimum test RSS data points in blue
lines((1:p), minrss, col="red", lwd=1.7)
# add legend
legend("topright", c("Train", "Test"), col=c("blue", "red"), pch=1)NA# load prostate data
data(prostate)
# create subset of observations with 10 variables
small = prostate[1:5,]
# fit linear model
lm(lpsa ~ .,data =small)##
## Call:
## lm(formula = lpsa ~ ., data = small)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph
## 9.60615 0.13901 -0.79142 0.09516 NA
## svi lcp gleason pgg45 trainTRUE
## NA NA -2.08710 NA NA
MASS package) ridge<-lm.ridge(outcome ~ predictors, data=training, lambda=5) = perform ridge regression with given outcome and predictors using the provided \(\lambda\) value
lambda=5 = tuning parameterridge$xm = returns column/predictor mean from the dataridge$scale = returns the scaling performed on the predictors for the ridge regression
ridge$coef = returns the conditional coefficients, \(\beta_j\) from the ridge regressionridge$ym = return mean of outcomecaret package) train(outcome ~ predictors, data=training, method="ridge", lambda=5) = perform ridge regression with given outcome and predictors
preProcess=c("center", "scale") = centers and scales the predictors before the model is built
lambda=5 = tuning parametercaret package) train(outcome ~ predictors, data=training, method="foba", lambda=5, k=4) = perform ridge regression with variable selection
lambda=5 = tuning parameterk=4 = number of variables that should be retained
length(predictors)-k variables will be eliminatedcaret package) predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithmsprostate dataset, we will run ridge regressions with different values of \(\lambda\) and find the optimum \(\lambda\) value that minimizes test RSS# load MASS library for lm.ridge function
library(MASS)
# we will run the models for 10 different values from 0 to 50
lambdas <- seq(0,50,len=10)
# create empty vectors for training and test residual sum squares
M <- length(lambdas)
train.rss <- rep(0,M); test.rss <- rep(0,M)
# create empty matrix for coefficients/betas
betas <- matrix(0,ncol(x),M)
# run ridge regressions on the predictors for all values of lambda
for(i in 1:M){
# create formula for the ridge regression (outcome ~ predictors)
Formula <-as.formula(paste("y~",paste(covnames,collapse="+"),sep=""))
# run ridge regression with the ith lambda value
fit1 <- lm.ridge(Formula,data=x,lambda=lambdas[i])
# store the coefficients of the ridge regression as a column in betas matrix
betas[,i] <- fit1$coef
# center the training data based on column means
scaledX <- sweep(as.matrix(x),2,fit1$xm)
# scale the training data based on standard deviations
scaledX <- sweep(scaledX,2,fit1$scale,"/")
# calculate predictions for training data using the model coefficients
yhat <- scaledX%*%fit1$coef+fit1$ym
# calculate residual sum squares for training data
train.rss[i] <- sum((y - yhat)^2)
# center the test data based on column means
scaledX <- sweep(as.matrix(x.test),2,fit1$xm)
# scale the test data based on standard deviations
scaledX <- sweep(scaledX,2,fit1$scale,"/")
# calculate predictions for test data using the model coefficients
yhat <- scaledX%*%fit1$coef+fit1$ym
# calculate residual sum squares for test data
test.rss[i] <- sum((y.test - yhat)^2)
}
# plot the line for test RSS vs lambda in red
plot(lambdas,test.rss,type="l",col="red",lwd=2,ylab="RSS",ylim=range(train.rss,test.rss))
# plot the line for training RSS vs lambda in blue
lines(lambdas,train.rss,col="blue",lwd=2,lty=2)
# find the best lambda that minimizes test RSS
best.lambda <- lambdas[which.min(test.rss)]
# plot vertical line marking the best lambda
abline(v=best.lambda)
# add text label
text(best.lambda+5, min(test.rss)-2, paste("best lambda=", round(best.lambda, 2)))
# add legend to plot
legend(30,30,c("Train","Test"),col=c("blue","red"),lty=c(2,1))# plot all coefficients vs values of lambda
plot(lambdas,betas[1,],ylim=range(betas),type="n",ylab="Coefficients")
for(i in 1:ncol(x))
lines(lambdas,betas[i,],type="b",lty=i,pch=as.character(i), cex = 0.5)
# add horizontal line for reference
abline(h=0)
# add legend to plot
legend("topright",covnames,pch=as.character(1:8), cex = 0.5)lars package) lasso<-lars(as.matrix(x), y, type="lasso", trace=TRUE) = perform lasso regression by adding predictors one at a time (or setting some variables to 0)
as.matrix(x) = the predictors must be in matrix/dataframe formattrace=TRUE = prints progress of the lasso regressionlasso$lambda = return the \(\lambda\)s used for each step of the lasso regressionplot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by onepredit.lars(lasso, test) = use the lasso model to predict on test data
?predit.lars lars package) cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE) = computes K-fold cross-validated mean squared prediction error for lasso regression
lars function is run K times with each of the folds to estimate theK=10 = create 10-fold cross validationtrace=TRUE = prints progress of the lasso regressionenet package) lasso<-enet(predictors, outcome, lambda = 0) = perform elastic net regression on given predictors and outcome
lambda=0 = default value for \(\lambda\)
lambda=0 tells the function to fit a lasso regression plot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by onepredict.ent(lasso, test)= use the lasso model to predict on test datacaret package) train(outcome ~ predictors, data=training, method="lasso") = perform lasso regression with given outcome and predictors
preProcess=c("center", "scale") = centers and scales the predictors before the model is built
caret package) train(outcome~predictors,data=train,method="relaxo",lambda=5,phi=0.3) = perform relaxed lasso regression on given predictors and outcome
lambda=5 = tuning parameterphi=0.3 = relaxation parameter
phi=1 corresponds to the regular Lasso solutionsphi=0 computes the OLS estimates on the set of variables selected by the Lassocaret package) predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithmslars package# load lars package
library(lars)
# perform lasso regression
lasso.fit <- lars(as.matrix(x), y, type="lasso", trace=TRUE)
# plot lasso regression model
plot(lasso.fit, breaks=FALSE, cex = 0.75)
# add legend
legend("topleft", covnames, pch=8, lty=1:length(covnames),
col=1:length(covnames), cex = 0.6)# plots the cross validation curve
lasso.cv <- cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE)\[\begin{aligned} \mbox{majority vote accuracy} & = p(3~correct,~2~wrong) + p(4~correct,~1~wrong) \\ &\qquad+ p(5~correct) \\ & = {5 \choose 3} \times(0.7)^3(0.3)^2 + {5 \choose 4} \times(0.7)^4(0.3)^1 - {5 \choose 5} (0.7)^5 \\ & = 10 \times(0.7)^3(0.3)^2 + 5 \times(0.7)^4(0.3)^2 - 1 \times (0.7)^5 \\ & = 83.7% \\ \end{aligned}\]
library(ISLR); library(caret); library(ggplot2)
data(Wage)
Wage <- subset(Wage, select=-c(logwage))
# create a building data set and validation set
inBuild <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
validation <- Wage[-inBuild,]; buildData <- Wage[inBuild,]
inTrain <- createDataPartition(y=buildData$wage,p=0.7, list=FALSE)
training <- buildData[inTrain,]; testing <- buildData[-inTrain,]
# train the data using both glm and random forest models
glm.fit <- train(wage ~.,method="glm",data=training)
rf.fit <- train(wage ~.,method="rf",data=training,
trControl = trainControl(method="cv"),number=3)
# use the models to predict the results on the testing set
glm.pred.test <- predict(glm.fit,testing)
rf.pred.test <- predict(rf.fit,testing)
# plot the predicted values from the two models
qplot(glm.pred.test,rf.pred.test,colour=wage,data=testing)# combine the prediction results and the true results into new data frame
combinedTestData <- data.frame(glm.pred=glm.pred.test,
rf.pred = rf.pred.test,wage=testing$wage)
# run a Generalized Additive Model (gam) model on the combined test data
comb.fit <- train(wage ~.,method="gam",data=combinedTestData)
# use the resultant model to predict on the test set
comb.pred.test <- predict(comb.fit, combinedTestData)
# use the glm and rf models to predict results on the validation data set
glm.pred.val <- predict(glm.fit,validation)
rf.pred.val <- predict(rf.fit,validation)
# combine the results into data frame for the comb.fit
combinedValData <- data.frame(glm.pred=glm.pred.val,rf.pred=glm.pred.val)
# run the comb.fit on the combined validation data
comb.pred.val <- predict(comb.fit,combinedValData)
# tabulate the results - test data set RMSE Errors
rbind(test = c(glm = sqrt(mean((glm.pred.test-testing$wage)^2)),
rf = sqrt(mean((rf.pred.test-testing$wage)^2)),
combined = sqrt(mean((comb.pred.test-testing$wage)^2))),
# validation data set RMSE Errors
validation = c(sqrt(mean((glm.pred.val-validation$wage)^2)),
sqrt(mean((rf.pred.val-validation$wage)^2)),
sqrt(mean((comb.pred.val-validation$wage)^2))))## glm rf combined
## test 34.26616 35.58763 33.90359
## validation 35.40900 36.42254 35.25649
Note: more detailed tutorial can be found in Rob Hyndman’s Forecasting: principles and practice
ma vs EMA - ets, see below) and apply corresponding functions to training dataforecast functionaccuracy functionsimple moving averages = prediction will be made for a time point by averaging together values from a number of prior periods \[ Y_{t}=\frac{1}{2*k+1}\sum_{j=-k}^k {y_{t+j}}\]
quantmod package can be used to pull trading/price information for publicly traded stocks
getSymbols("TICKER", src="google", from=date, to=date) = gets the daily high/low/open/close price and volume information for the specified stock ticker
"TICKER" = ticker of the stock you are attempting to pull information forsrc="google" = get price/volume information from Google finance
from and to = from and to dates for the price/volume information
date objectsgetSymbols can be found in the documentation ?getSymbols to.monthly(GOOG) = converts stock data to monthly time series from daily data
GOOG = data frame returned from getSymbols function?to.period contains documentation for converting time series to OHLC (open high low close) series googOpen<-Op(GOOG) = returns the opening price from the stock data frame
Cl(), Hi(), Lo() = returns the close, high and low price from the stock data framets(googOpen, frequency=12) = convert data to a time series with frequency observations per time unit
frequency=12 = number of observations per unit time (12 in this case because there are 12 months in each year \(\rightarrow\) converts data into yearly time series)decompose(ts) = decomposes time series into trend, seasonal, and irregular components by using moving averages
ts = time series objectwindow(ts, start=1, end=6) = subsets the time series at the specified starting and ending points
start and end arguments must correspond to the time unit rather than the index
ts is a yearly series (frequency = 12), start/end should correspond to the row numbers or year (each year has 12 observations corresponding to the months)c(1, 7) can be used to specify the element of a particular year (in this case, July of the first year/row)start/end, and the closest element (June of the 9th year in this case) will be used end=9-0.01 can be used a short cut to specify “up to 9”, since end = 9 will include the first element of the 9th row forecast package can be used for forecasting time series data
ma(ts, order=3) = calculates the simple moving average for the order specified
order=3 = order of moving average smoother, effectively the number of values that should be used to calculate the moving averageets(train, model="MMM") = runs exponential smoothing model on training data
model = "MMM" = method used to create exponential smoothing
?ets and the corresponding model chart is here forecast(ts) = performs forecast on specified time series and returns 5 columns: forecast values, high/low 80 confidence intervals bounds, high/low 95 percent interval bounds
plot(forecast) = plots the forecast object, which includes the training data, forecast values for test periods, as well as the 80 and 95 percent confidence interval regionsaccuracy(forecast, testData) = returns the accuracy metrics (RMSE, etc.) for the forecast modelquandl package is also used for finance-related predictions# load quantmod package
library(quantmod);
# specify to and from dates
from.dat <- as.Date("01/01/00", format="%m/%d/%y")
to.dat <- as.Date("3/2/15", format="%m/%d/%y")
# get data for AAPL from Google Finance for the specified dates
getSymbols("AAPL", src="google", from = from.dat, to = to.dat)## [1] "AAPL"
# convert the retrieved daily data to monthly data
mAAPL <- to.monthly(AAPL)
# extract the closing price and convert it to yearly time series (12 observations per year)
ts <- ts(Cl(mAAPL), frequency = 12)
# plot the decomposed parts of the time series
plot(decompose(ts),xlab="Years")# load forecast library
library(forecast)
# find the number of rows (years)
rows <- ceiling(length(ts)/12)
# use 90% of the data to create training set
ts.train <- window(ts, start = 1, end = floor(rows*.9)-0.01)
# use the rest of data to create test set
ts.test <- window(ts, start = floor(rows*.9))
# show time series on the training set
ts.train## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 3.71 4.09 4.85 4.43 3.00 3.74 3.63 4.35 1.84 1.40 1.18 1.06
## 2 1.54 1.30 1.58 1.82 1.42 1.66 1.34 1.33 1.11 1.25 1.52 1.56
## 3 1.77 1.55 1.69 1.73 1.66 1.27 1.09 1.05 1.04 1.15 1.11 1.02
## 4 1.03 1.07 1.01 1.02 1.28 1.36 1.51 1.62 1.48 1.64 1.49 1.53
## 5 1.61 1.71 1.93 1.84 2.00 2.32 2.31 2.46 2.77 3.74 4.79 4.58
## 6 5.49 6.41 5.95 5.15 5.68 5.26 6.09 6.70 7.66 8.23 9.69 10.27
## 7 10.79 9.78 8.96 10.06 8.54 8.18 9.71 9.69 11.00 11.58 13.09 12.12
## 8 12.25 12.09 13.27 14.26 17.31 17.43 18.82 19.78 21.92 27.14 26.03 28.30
## 9 19.34 17.86 20.50 24.85 26.96 23.92 22.71 24.22 16.24 15.37 13.24 12.19
## 10 12.88 12.76 15.02 17.98 19.40 20.35 23.34 24.03 26.48 26.93 28.56 30.10
## 11 27.44 29.23 33.57 37.30 36.70 35.93 36.75 34.73 40.54 43.00 44.45 46.08
## 12 48.47 50.46 49.79 50.02 49.69 47.95 55.78 54.98 54.47 57.83 54.60 57.86
## 13 65.21 77.49 85.65 83.43 82.53 83.43 87.25 95.03 95.30 85.05 83.61 76.02
# plot the training set
plot(ts.train)
# add the moving average in red
lines(ma(ts.train,order=3),col="red")# compute the exponential smoothing average
ets <- ets(ts.train,model="MMM")
# construct a forecasting model using the exponential smoothing function
fcast <- forecast(ets)
# plot forecast and add actual data in red
plot(fcast); lines(ts.test,col="red")# print the accuracy of the forecast model
accuracy(fcast,ts.test)## ME RMSE MAE MPE MAPE MASE
## Training set 0.3738734 2.576067 1.438703 0.2377946 10.38942 0.1759433
## Test set 0.7292502 18.507031 15.347659 -3.6663982 19.01227 1.8769111
## ACF1 Theil's U
## Training set 0.1026281 NA
## Test set 0.8661382 3.200942
kmeans(data, centers=3) = can be used to perform clustering from the provided data
centers=3 = controls the number of clusters the algorithm should aim to divide the data intocl_predict function in clue package provides similar functionalitylibrary(ggplot2)
# load iris data
data(iris)
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]; testing <- iris[-inTrain,]
# perform k-means clustering for the data without the Species information
# Species = what the true clusters are
kMeans1 <- kmeans(subset(training,select=-c(Species)),centers=3)
# add clusters as new variable to training set
training$clusters <- as.factor(kMeans1$cluster)
# plot clusters vs Species classification
p1 <- qplot(Petal.Width,Petal.Length,colour=clusters,data=training) +
ggtitle("Clusters Classification")
p2 <- qplot(Petal.Width,Petal.Length,colour=Species,data=training) +
ggtitle("Species Classification (Truth)")
grid.arrange(p1, p2, ncol = 2)# tabulate the results from clustering and actual species
table(kMeans1$cluster,training$Species)##
## setosa versicolor virginica
## 1 0 33 4
## 2 0 2 31
## 3 35 0 0
# build classification trees using the k-means clusters
clustering <- train(clusters ~.,data=subset(training,select=-c(Species)),method="rpart")# tabulate the prediction results on training set vs truth
table(predict(clustering,training),training$Species)##
## setosa versicolor virginica
## 1 0 31 1
## 2 0 4 34
## 3 35 0 0
# tabulate the prediction results on test set vs truth
table(predict(clustering,testing),testing$Species)##
## setosa versicolor virginica
## 1 0 13 0
## 2 0 2 15
## 3 15 0 0
doMC packagecaret package are computationally intensivedoMC package is recommended to be used for caret computations (reference)
doMC::registerDoMC(cores=4) = registers 4 cores for R to utilizecaret uses a technique, the doMC package, which is not available for Microsoft Windows versions of R.caret using parallel packagecaret::train() is relatively straightforward, and includes the following steps:Prerequisite: Selecting a Machine Learning Problem
Sonar data setlibrary(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set up training run for x / y syntax because model format performs poorly
x <- training[,-61]
y <- training[,61]Step 1: Configure parallel processing
caret can be accomplished with the parallel and doParallel packages. The following code loads the required libraries (note, these libraries also depend on the iterators and foreach libraries).library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)Step 2: Configure trainControl object
method, the number that specifies the quantity of folds for k-fold cross-validation, and allowParallel which tells caret to use the cluster that we’ve registered in the previous step.fitControl <- trainControl(method = "cv",
number = 10,
allowParallel = TRUE)Step 3: Develop training model
caret::train() to train the model, using the trainControl() object that we just created.fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)Step 4: De-register parallel processing cluster
stopCluster() and registerDoSEQ() functions. registerDoSEQ() function is required to force R to return to single threaded processing.stopCluster(cluster)
registerDoSEQ()fit object, and can take a number of steps to evaluate the suitability of this model, including accuracy and a confusion matrix that is based on comparing the modeled data to the held out folds.fit
fit$resample
confusionMatrix.train(fit)